Experiments in Data Compression II
       (or "Why couldn't I have thought of this sooner?")
                         by John Kormylo
                         E.L.F. Software


State of the Art

     In the following table are shown the results for compressing 
three  different  files using  three  different  methods.   'ZIP' 
refers  to STZIP 2.5 (deflating).   'OLD' refers to the  improved 
LZA  method documented in the previous report using a single  16K 
search  tree.    'BEST'  refers  to  the  best  results  obtained 
previously,  using  three search trees and testing  all  possible 
ways to compress the next 65 characters.


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
   ZIP |       25207  |      32444  |      496880  |
   OLD |       24942  |      34003  |      670774  |
  BEST |       22196  |      32035  |      455017  |
-------+--------------+-------------+--------------+


Redundancy Weighting

     The  main inefficiency of the old LZA method is  that  while 
there can be many matches of a given length, only one can be used 
at  a time.   No matter how heavily the entry used  is  weighted, 
there is still lost information in the other possible matches.
     Using a hash table instead of a search tree could solve this 
problem,  but  hash  tables are inefficient for  storage  (a  16K 
binary  search tree with a maximum match length of 65  characters 
would  require up to 1,054,960 entries in a hash  table).   Also, 
once something is put into a hash table it can never be taken out 
again (at least, using the implementation documented in "The Data 
Compression Book" by Mark Nelson).
     However,   since   the  binary  search  tree   has   already 
alphabetized  the  "dictionary"  of  possible  matches,   it   is 
relatively easy to use an index corresponding to its position  in 
alphabetical order.   This can be done easily by including in the 
node  structure  a  count of how many nodes are  in  this  branch 
(actually, the self-balancing functions already require this).
     More  important,  it is also easy to locate the largest  and 
smallest indexes corresponding to a match of a given length  (and 
certainly  faster  than searching for the most  heavily  weighted 
match).  The 'weight' for this 'match' is therefore the number of 
matches for this length in the search tree (as it turns out,  the 
arithmetic compression algorithm normally implements weights as a 
range of index values).   The length (and match indicator)  codes 
are implemented using character codes 256 through 320, as before.
     When  expanding  the data,  one will get some value  in  the 
range  of possible matches.   One would then search  through  the 
binary tree to locate the specific dictionary entry.   Given  the 
text pointer from this node,  one can then go through the  search 
tree  again and locate the largest and smallest matches for  this 
length.
     Obviously, this approach is more efficient than any possible 
weighting scheme involving single dictionary entries.  Also, this 
approach is relatively insensitive to dictionary size,  since the 
number  of matches should increase proportionately as the  number 
of entries increase (assuming a uniformly random distribution  of 
matches).   Other advantages are that the node structure does not 
need to store the weight,  last match length or sort  index,  and 
that one need not sort the indexes into weight order.
     The  main disadvantage is that the expansion algorithm  must 
also sort the entries into a binary search tree.   Not only  does 
this make the expansion slower,  it requires that the compression 
algorithm not use information not yet coded when constructing the 
search tree.  Normally, the compression algorithm uses all 65 (or 
whatever the maximum match length is) future characters to  place 
a  text pointer into the tree.   Needless to say,  the  expansion 
algorithm does not have access to these future characters.
     Another disadvantage is that one cannot use this approach to 
match  future  characters.   When the old LZA method ran  into  a 
sequence of repeated characters (not yet in the  dictionary),  it 
would  code the first normally and code the rest by matching  the 
text pointer just entered into the tree.  The expansion algorithm 
would recursively generate the entire sequence:

     for(i=0; i<len; i++) text[i+1] = text[i];


Backwards Matching

     Instead  of sorting strings in alphabetical order from  left 
to right,  one could sort strings from right to left.  The search 
tree  would  then  consist of pointers to the  ENDS  of  matches, 
rather than the STARTS of matches.  Obviously, this approach does 
not  use  future characters,  so the  compression  and  expansion 
algorithms will automatically keep in step.
     The  disadvantage is that instead of searching  through  the 
tree  for the longest match at a given start location,  one  must 
search  through  the tree for a match of a known  length  for  65 
possible  ending  locations.    (Actually,   one  can  save  this 
information  from one step to the next,  but it still requires  a 
search for every byte in the message instead of only those  bytes 
not skipped.)  This approach is slower than 'OLD' but much faster 
than 'BEST'.
     The  following  table shows the results  using  the  longest 
match  (as  'LONG'),  using the match with the  fewest  bits  per 
character (as 'AVG') and searching all possible ways to  compress 
the next 65 bytes (as 'BEST' again).


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
  LONG |       23322  |      34362  |      649317  |
   AVG |       23043  |      34693  |      660107  |
  BEST |       22007  |      33276  |      580261  |
-------+--------------+-------------+--------------+


These  results were somewhat disappointing.   While  new  records 
were  set  for  DEMONSTR.DOC,  this  method  seems  fundamentally 
incapable of performing well on the other two files.  (All of the 
above results were for a 16K search tree.)
     The next table shows the results from using the above  'AVG' 
method for various tree sizes.  The results (as compared to those 
in  the  previous  report) for DEMONSTR.DOC are  what  should  be 
expected when the statistical assumptions (uniformly  distributed 
matches)  are  satisfied:  decreasing effectiveness as  the  tree 
size decreases and slightly better compression for all tree sizes 
(above  256).   The results for WORDLIST.TXT are what  should  be 
expected  when matches tend to be locally  clustered:  increasing 
effectiveness  as  the tree size decreases (up to  a  point)  and 
slightly better results (for tree sizes above 512).   The results 
for  ELFBACK.PRG possibly indicate a lot of  repeated  sequences: 
less effectiveness for all tree sizes.

  tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
    32 |       45008  |      45812  |      544218  |
    64 |       42188  |      43799  |  >   536383  |
   128 |       38167  |      41574  |      538023  |
   256 |       34187  |      39331  |      548685  |
   512 |       31076  |      37480  |      564429  |
  1024 |       27844  |      36567  |      581900  |
  2048 |       25690  |      35234  |      601221  |
  4096 |       24088  |      34728  |      624977  |
  8192 |       23235  |  >   34571  |      644742  |
 16384 |  >    23043  |      34693  |      660107  |


FIFO Buffer

     An alternative solution is to not add match locations to the 
search tree until all 65 characters have been coded.  This can be 
implemented using a 64 entry FIFO (First In/ First Out) buffer to 
hold these new locations.  In fact, one can use this buffer as an 
alternative  search  tree,  similar to the multiple  search  tree 
methods  explored  in  the previous  report.   This  should  also 
improve performance on files with locally clustered matches.
     The  following  table shows the results  for  this  approach 
using a 64 entry tree (unweighted) and a 16K entry tree (weighted 
by  the number of matches).   Again,  using the longest match  is 
labeled as 'LONG',  using the fewest bits/character is labeled as 
'AVG', and using the (nearly) optimal selections as 'BEST'.


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
  LONG |       24618  |      33418  |      597383  |
   AVG |       23991  |      33269  |      555619  |
  BEST |       21893  |      32147  |      496517  |
-------+--------------+-------------+--------------+

Both the LONG and AVG algorithms out-performed the OLD  algorithm 
on all three files.   However,  the new BEST algorithm  performed 
better on only one of the files.
     It  should be noted that entries in the FIFO buffer  can  be 
flushed  to  the  main dictionary as soon as  all  65  bytes  are 
transmitted,  effectively  reducing the size of the FIFO  buffer.  
However,  as shown in the next table,  this doesn't buy you much.  
In fact, it is worse on two of the three files:


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
  LONG |       24630  |      33544  |      599469  |
   AVG |       24082  |      33466  |      540713  |
  BEST |       22034  |      32345  |      473610  |
-------+--------------+-------------+--------------+


On  the  other hand,  increasing the FIFO buffer to  128  entries 
doesn't help either.


FIFO Weighting

     The  reason  the  FIFO buffer  approach  works  better  than 
backwards sorting is that the statistical assumption of a uniform 
random distribution of matches is not valid for some  files.   In 
this case the probabilty of finding a match in the 64 entry  FIFO 
buffer is much greater than the number of equilvalent matches  in 
the main dictionary divided by the total number of entries.
     That being the case,  it would also seem reasonable that the 
probability  of  finding a match in the most recent half  of  the 
FIFO buffer is greater than in the older half,  etc.   One  could 
assign  a  weight associated with each entry in the  FIFO  buffer 
based on how long it has been there,  and use those weights  when 
coding the index.   These weights would,  of course,  consist  of 
(scaled)  counts  of  how many times this  'lag'  has  been  used 
(initially  set to 1).   Since there are only  64  entries,  this 
should not impose much additional computation cost.
     The next table shows the results for using these weights for 
the FIFO buffer:


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
  LONG |       24620  |      33375  |      583218  |
   AVG |       23991  |      33207  |      530798  |
  BEST |       22022  |      32088  |      461457  |
-------+--------------+-------------+--------------+

As   expected,   these  results  were  not  quite  as  good   for 
DEMONSTR.DOC,  but better for the other two files.  A list of the 
weights  generated  using  the 'AVG' approach  is  shown  in  the 
Appendix.   (It  should be pointed out that this is not  strictly 
speaking a 'lag' in the time series sense,  since each  increment 
can mean anywhere from 1 to 65 bytes.)


Break Even

     Another advantage of redundancy weighting is that it does  a 
much  better job for smaller matches.   (One could  theoretically 
use  it for single characters,  except that you would still  need 
some way to get the dictionary started.)  However,  early testing 
indicated  that  the best results were obtained  by  never  using 
matches  of  length 2 and always using matches of  length  3  (as 
opposed  to  testing for break even as was done using  the  'OLD' 
LZA algorithm.)
     The  next table show the results of using length  2  matches 
with FIFO weighting:


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
  LONG |       24860  |      34719  |      585185  |
   AVG |       24054  |      34394  |      520875  |
  BEST |       22029  |      31997  |      453143  |
-------+--------------+-------------+--------------+


Comparing  this  to the previous table,  we  see  that  including 
length 2 matches only works well with the 'BEST' algorithm (which 
implicitly contains a break even test).   More importantly,  this 
allows the new 'BEST' to beat the old 'BEST' (in the first table) 
on all three files.





      Appendix - Normalized Weights (count / total * 1000)

  lag  | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
    1  |        3     |      21     |       12     |
    2  |        5     |      63     |      251     |
    3  |       14     |      23     |      109     |
    4  |        8     |      66     |       88     |
    5  |       18     |      33     |       59     |
    6  |       24     |      37     |       48     |
    7  |       13     |      24     |       35     |
    8  |       19     |      42     |       31     |
    9  |       21     |      28     |       26     |
   10  |        8     |      35     |       23     |
   11  |       21     |      22     |       19     |
   12  |       20     |      26     |       16     |
   13  |       33     |      20     |       15     |
   14  |       24     |      24     |       13     |
   15  |       21     |      18     |       13     |
   16  |       19     |      17     |       12     |
   17  |       26     |      18     |       10     |
   18  |       25     |      16     |       10     |
   19  |        9     |      18     |       10     |
   20  |       25     |      15     |        9     |
   21  |       15     |      18     |        8     |
   22  |       17     |      17     |        8     |
   23  |       14     |      18     |        7     |
   24  |       28     |      15     |        7     |
   25  |       22     |      13     |        7     |
   26  |       17     |      13     |        6     |
   27  |       23     |      20     |        6     |
   28  |       16     |      15     |        6     |
   29  |       17     |      14     |        6     |
   30  |       17     |       9     |        5     |
   31  |       18     |      14     |        6     |
   32  |       14     |      11     |        5     |
   33  |       34     |      11     |        5     |
   34  |       12     |       8     |        5     |
   35  |       26     |      12     |        5     |
   36  |       13     |      10     |        4     |
   37  |       17     |       9     |        5     |
   38  |       17     |      13     |        5     |
   39  |       20     |       7     |        4     |
   40  |       13     |      10     |        4     |
   41  |       16     |      10     |        4     |
   42  |       11     |       9     |        4     |
   43  |        9     |       7     |        4     |
   44  |        9     |       8     |        4     |
   45  |       15     |       3     |        4     |
   46  |       13     |       4     |        3     |
   47  |       18     |       8     |        3     |
   48  |        7     |       9     |        3     |
   49  |       14     |       5     |        3     |
   50  |        9     |       9     |        3     |
-------+--------------+-------------+--------------+
  lag  | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
   51  |       15     |       9     |        3     |
   52  |       14     |       5     |        3     |
   53  |       20     |       8     |        3     |
   54  |        9     |       6     |        3     |
   55  |        8     |       6     |        3     |
   56  |        8     |       9     |        3     |
   57  |       13     |       5     |        3     |
   58  |       10     |      11     |        3     |
   59  |        4     |      13     |        3     |
   60  |       15     |       9     |        2     |
   61  |       12     |       4     |        2     |
   62  |        7     |       9     |        3     |
   63  |        8     |       8     |        3     |
   64  |        9     |       7     |        2     |
-------+--------------+-------------+--------------+